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Abstract 

The  need  for  optimal  target  detection  arises  in  many  different  fields.  Due  to  the 
complexity  of  many  targets,  it  is  thought  that  the  combination  of  multiple  classification 
systems,  which  can  be  tuned  to  several  individual  target  attributes  or  features,  might  lead 
to  more  optimal  target  detection  performance.  The  ROC  curves  of  fused  independent 
two-label  classification  systems  may  be  generated  by  the  mathematical  combination  of 
their  ROC  curves  to  achieve  optimal  classifier  performance  without  the  need  to  test  every 
Boolean  combination.  The  monotonic  combination  of  two-label  independent  classification 
systems  which  assign  labels  to  the  same  target  types  results  in  a  lattice  of  ROC  curves 
which  are  epimorphic  to  the  corresponding  combinations  of  classification  systems.  Provided 
the  ROC  curves  of  individual  systems  are  available,  testing  the  lattice  of  ROC  curves  in 
software  with  existing  individual  ROC  curves  can  represent  a  significant  cost  savings  in 
the  design  of  optimal  classification  systems. 
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THE  ROC  CURVES  OF  FUSED  INDEPENDENT  CLASSIFICATION 


SYSTEMS 


I.  Introduction 

1.1  Motivation 

The  need  to  detect  the  presence  of  a  target  in  temporal,  spatial,  or  spectral  settings 
arises  in  many  fields  of  study;  in  medicine,  the  detection  of  a  cancer;  in  marketing,  the 
detection  of  the  best  customer  base;  in  the  national  command  structure,  the  detection  of 
military  targets  in  a  theater  of  operation.  The  process  of  labeling  or  classifying  a  target 
typically  begins  with  a  sensor  which  detects  certain  attributes,  generating  raw  data.  The 
data  might  need  further  processing  to  allow  for  the  extraction  of  desired  features,  which 
may  not  be  directly  measurable.  Once  criteria  necessary  for  making  a  decision  about  the 
presence  of  a  target  is  obtained,  one  can  label  or  classify  the  targets  and  non-targets. 
Since  most  targets  are  composed  of  many  parts,  it  may  be  necessary  to  detect  multiple 
attributes  prior  to  accurately  assigning  the  target  label.  Hence  it  is  often  thought  that  the 
combination  of  multiple  classification  systems,  which  may  use  the  same  or  diverse  feature 
sets,  gives  more  accurate  and  reliable  information  than  the  use  of  a  single  classification 
system. 

1.2  Problem  Statement 

In  two-class  scenarios  the  combination  of  multiple  classification  systems  may  be  done 
in  many  different  ways.  What  is  of  interest  are  combinations  yielding  the  best  true  pos¬ 
itive  rates  while  keeping  the  false  alarm  rate  below  acceptable  thresholds.  Since  we  are 
investigating  two-label  classification  systems,  it  makes  the  most  sense  to  use  Boolean  rules, 
thereby  leveraging  all  that  is  known  with  regard  to  Boolean  Algebras  towards  our  field  of 
target  classification.  In  fact,  if  we  consider  the  whole  set,  the  empty  set,  the  meet,  join, 
and  complement  of  every  Boolean  rule,  we  are  indeed  generating  a  Boolean  Algebra  of 
Classification  System  families.  How  well  a  particular  Boolean  combination  performs  can 
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be  quantified  by  using  what  is  known  as  a  Receiver  Operating  Characteristic  (ROC)  curve, 
which  was  originally  developed  to  analyze  radar  signals  and  employed  in  signal  detection 
theory  [17].  For  one  to  consider  any  fusion  rule  other  than  Boolean,  one  should  be  con¬ 
vinced  that  it  performs  better  than  or  equal  to  any  combination  in  our  Boolean  Algebra. 
When  considering  the  Boolean  combination  of  multiple  classification  systems,  one  would 
be  most  interested  in  finding  a  combination  of  Boolean  operations  on  the  classification  sys¬ 
tems  which  yields  optimal  performance  without  having  the  need  to  test  each  combination. 
This  way  a  substantial  cost  savings  could  be  realized. 

1.3  Scope 

For  the  purpose  of  this  thesis  we  will  investigate  the  combination  of  multiple  two- 
label  independent  classification  systems.  This  is  sometimes  referred  to  as  decision  fusion 
or  label  fusion,  not  to  be  confused  with  data  fusion,  or  feature  fusion,  both  of  which  may 
occur  earlier  in  the  target  detection  process.  We  will  restrict  our  attention  to  classifiers 
which  assign  the  same  target  labels.  Classifiers  which  assign  labels  at  different  levels  in  the 
same  genera  may  be  combined  as  well,  but  this  is  known  as  hierarchical  fusion.  We  will  not 
consider  classifiers  which  have  more  than  two  labels  (e.g.  friend,  foe,  unknown)  since  the 
mathematics  and  transforms  to  handle  these  are  beyond  the  capability  of  the  ROC  curve 
to  represent.  Independent  classification  systems  will  be  considered,  while  correlated  ones 
will  be  avoided  to  keep  the  manipulation  of  conditional  probabilities  manageable.  Once  a 
target  has  been  classified,  further  refinements  might  be  made,  which  can  be  grouped  under 
what  is  known  as  target  identification. 
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II.  Mathematical  Background 


This  chapter  will  discuss  the  mathematical  theory  needed  for  the  main  results. 

2.1  Philosophical  Background 

A  classifier,  which  assigns  a  binary  label  (true/false  or  target/no  target),  does  so 
based  upon  information  provided  to  it,  and  in  reference  to  a  model.  The  classifier  model 
contains  a  set  of  attributes  or  features  germain  to  the  target.  The  data  which  is  fed  to  a 
classification  system  may  be  thought  of  as  a  combination  of  noise  and  signal.  As  the  data 
is  processed  through  algorithms  and  filters  features  are  extracted  from  it.  The  features 
or  attributes  present  in  the  data  set  are  matched  against  a  set  of  criteria  from  the  model 
feature  set.  Each  feature-based  criterion  may  be  a  threshold  value  (real  valued),  a  binary 
state  (integer/discrete),  an  rn-ary  state  (e.g.  radio  button),  and  the  like.  Complicated 
targets  having  multiple  attributes  may  require  more  in-depth  classifier  models.  Those  at¬ 
tributes  which  are  directly  measurable  are  called  data,  while  those  which  require  processing 
to  extract  are  called  features.  Ultimately  all  of  the  attributes  necessary  are  assembled  to 
compare  with  a  classifier  model  to  make  the  classification.  For  example,  a  fingerprint  from 
the  right  index  finger  is  sufficient  to  identify  every  living  person,  and  that  is  based  upon  a 
set  of  attributes  resident  in  that  one  fingerprint.  In  most  situations,  several  attributes  are 
compared  with  the  model  prior  to  a  classification.  In  our  research,  we  will  restrict  ourselves 
to  single  attribute  based  classifiers.  This  simplifies  the  math  insofar  as  we  can  look  at  one 
attribute  at  a  time,  and  if  the  attribute  is  real  valued  the  threshold  in  the  model  may  be 
varied  to  allow  a  look  at  true  positive  versus  false  positive  rates  as  a  function  of  the  varied 
threshold,  which  we  will  call  the  classifier’s  parameter. 

2. 2  Types  of  Fusion 

There  are  many  ways  of  fusing  outputs  from  multiple  classifiers.  Depending  on  user 
requirements,  fusion  may  occur  at  the  data,  feature,  or  label  phases  of  the  classification 
process.  Our  focus  will  be  the  fusion  of  multiple  labels  to  generate  a  total  label. 
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In  many  applications,  data  from  multiple  sensors  are  pipelined  to  a  common  proces¬ 
sor.  These  data  may  represent  different  aspects  of  an  event.  It  is  not  as  common  to  find 
systems  which  fuse  identical  data  sets  unless  a  further  requirement  of  double,  triple,  or 
quadruple  redundancy  is  imposed.  Once  the  data  is  in  a  common  bucket,  whether  that 
be  spatial,  temporal,  or  spectral,  it  may  be  processed  through  necessary  algorithms  to 
extract  desired  features.  Often  it  is  most  economical  to  get  disparate  data  from  multiple 
sensors  to  a  common  CPU,  where  one  can  focus  all  of  one’s  algorithmic  development  on 
one  processor,  and  leave  the  individual  sensors  less  sophisticated  and  less  costly. 

Extracting  attibutes  of  interest  often  narrows  the  amount  of  data  one  has  to  carry 
around  prior  to  target  classification.  If  one  can  obtain  the  principal  components  of  a 
matrix  of  data,  in  most  cases,  one  can  eliminate  the  components  which  have  the  least 
significance,  reducing  the  size  of  the  model  [4].  Transmission  of  data  also  plays  into  this 
game,  since  the  size  of  the  data  pipe  might  require  an  intelligent  shrinking  of  the  data  set 
so  as  not  to  saturate  the  pipeline.  For  example,  as  satellite-borne  sensors  move  towards 
hyperspectral  data  collections,  with  ever  increasing  data  sets,  the  need  to  perform  onboard 
feature  extraction  and  feature  fusion  prior  to  transmission  becomes  paramount  [14] . 

The  definition  of  label  fusion  is  the  combination  of  classifier  labels  after  the  target / no 
target  assignment  has  been  made.  The  nice  characteristic  about  this  type  of  fusion  is  that 
the  amount  of  data  to  be  handled  is  quite  small.  In  this  research,  which  is  restricted  to 
signature  classifiers  with  one  parameter  and  a  label  space  which  is  common  among  the 
classifiers,  we  have  the  ability  of  capturing  the  performance  of  each  classifier  as  a  function 
of  its  parameter  with  a  ROC  curve.  If  a  classifier  were  to  have  multiple  parameters, 
one  could  still  generate  (many)  ROC  curves  by  keeping  all  other  parameters  fixed  while 
allowing  one  parameter  to  vary.  From  the  collection  of  ROC  curves  one  could  chose 
the  piecewise  continuous  frontier  at  each  false  positive  threshold,  saving  the  values  of  the 
parameters  which  yielded  each  point  on  the  ROC  curve  frontier. 

2.3  Example  Fusion  Scenarios 

The  following  diagrams  show  various  ways  classification  systems  may  be  formed  and 

fused. 
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Figure  2.1:  Label  Fusion  of  Multiple  Classfication  Systems. 
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Figure  2.2:  Label  Fusion  After  Data  Fusion  Has  Occurred. 
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Figure  2.3:  Label  Fusion  After  Feature  Fusion  Has  Occurred. 
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2-4  Classification  System  Theory 

The  following  mathematical  treatment  is  attributed  to  Schubert,  Oxley,  and  Bauer 

[13]- 

Let  £  be  a  population  set  of  outcomes.  Let  <£  be  a  a- algebra  of  subsets  of  £,  then 
(£,  (£)  is  a  measurable  space  [10].  Let  Pg  be  a  probability  measure  defined  on  <E.  then 
(£,  <£,Pg)  is  a  probability  measure  space[10].  Let  s  be  a  sensor  that  produces  data  as  its 
output,  i.e.,  s  is  a  mapping  of  outcomes  from  the  population  set  £  to  a  datum.  Let  V 
denote  the  data  set.  Then  we  write  s  :  £  — ►  T>  or  its  diagram  £  — V.  Let  £>  be  a  a 
-algebra  of  subsets  of  V ,  then  ( V ,  D)  is  a  measurable  space. [2],  A  mapping,  p,  defined  on  V 
is  used  to  produce  an  element  x,  called  a  feature.  Let  the  mapping  p  represent  a  processor 
that  takes  a  datum  from  T>  and  produces  a  feature,  i.e.,  p  :  T>  — >  P  or  its  diagram  T>  — >  P 
.  Since  x  is  typically  a  vector  of  real  numbers,  then,  P  C  for  some  positive  integer 
N.  Let  J  be  a  u-algebra  of  subsets  from  P .  then  (  P ,  $)  is  a  measurable  space.  Let  0 
be  a  threshold  set  (or  a  set  of  parameters);  typically,  0  =  [0,1]  or  0  =  1  =  (—00,00). 
For  each  6  e  0  let  a@  be  a  classifier  mapping  P  into  a  label  set  £.  That  is,  a q  P  —*  C 
or  P  C  for  each  6  £  0.  Thus,  assume  (£,  £)  is  a  measurable  space  where  £  is  the 
power  set  of  C.  For  a  two-class  problem,  examples  of  a  label  set  could  be  C  =  {true, 
false},  C  =  {T,F},£  =  {0,1}  or  even  C  =  {target, non-target}.  For  some  classifiers  the 
label  set  is  a  continuum,  e.g.,  C  =  M.  In  this  thesis,  C  =  {t,n}  where  t  =  “target”  and 
n  =  “non-target” .  The  simple  graphical  representation  of  these  mappings  is  given  in  the 
following  diagram. 

£^^V— ?-^P^^jC  • 

Define  the  system  A g  to  be  the  composition  of  these  mappings  for  each  9  E  0.  That  is, 
for  each  9  £  0,  Ag  =  aj  o  p  o  s.  Graphically,  the  diagram  for  the  system  is  written  as 


The  Receiver  Operator  Characteristic  (ROC)  Curve  depicts  the  trade-off  between 
true  positives  and  false  positives  for  every  allowable  threshold  of  the  classifier.  The  per¬ 
formance  functional  (which  will  be  defined  later),  when  applied  to  the  ROC  curve,  gives 
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the  classfication  analyst  something  by  which  to  measure  the  goodness  of  a  classifier.  The 
following  mathematical  development  introduces  the  ROC  curve. 

Each  mapping  in  the  classification  system,  as  well  as  the  composition  of  mappings, 
has  a  pre-image.  The  general  definition  of  a  pre-image  follows  [9]: 

Definition  1.  (Pre-image)  Let  X  and  y  be  a  nonempty  sets.  Let  the  mapping  f  take  an 
element  from  X  and  map  it  into  T,  that  is,  f  :  X  — >  y.  Given  a  subset  Y  C  y  we  define 
the  pre-image  of  f  to  be  the  subset  in  X  by 

f^y)  =  {x  G  X  :  f(x)  €  Y}. 

Thus,  the  pre-image  of  a  subset  Y  in  y  is  all  the  elements  in  X  that  are  mapped  by  f  into 
Y. 

The  pre-image  is  sometimes  called  the  inverse  image  [2],  although  the  mapping  f 
need  not  be  invertible,  yet  the  superscript  —1  is  used.  Because  this  construction  creates 
a  natural  mapping  from  subsets  of  y  into  subsets  of  X,  the  natural  symbol  t]  will  be  used 
instead  of  —1.  Therefore,  we  write  f \Y)  =  X.  If  we  consider  the  entire  classification 
system  as  a  composition  of  mappings,  then  we  can  write  the  pre-image  of  a  specific  label, 
i  G  £  =  {i\,  ..Tn},  produced  by  the  classification  system  A g  .  Let  =  {£i}  so  that 
C  =  jQj  U££2  U  ...Cin  then  A \(Cg)  =  {e  G  £  :  Ag(e)  G  C(}.  The  use  of  pre-images  allows  us 
to  take  the  resulting  labels  and  express  these  in  terms  of  the  underlying  probabilities  [9]. 

Assume  the  label  set  is  C  =  {t,n}  where  t  and  n  may  be  real  values  or  symbols  and 
the  label  t  represents  a  “target”  and  the  label  n  represents  a  “non-target”.  Define  Ct  =  {t} 
and  Cn  =  {n}.  The  event  set  6  can  be  partitioned  into  a  target  event  set  containing  all 
target  outcomes  and  a  non-target  event  set  containing  non-target  outcomes.  Denote  the 
true  target  event  set  as  £t  and  the  true  non-target  event  set  as  £n.  Thus,  £  =  £n  and 

£tn  £n  =  0. 

In  order  to  quantify  how  well  the  classification  system  A g  performs,  we  appeal  to 
the  probability  measure  space  (<?,  <£,P)  to  compute  the  following  four  performance  quanti¬ 
fiers.  Let  Ptp{Aq)  denote  the  probability  of  true  positive  classification  of  the  classification 
system  A g.  Then  Prp(Ag)  is  the  probability  that  the  classification  system  A#  labels  an 
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outcome,  e,  as  a  target  label,  t.  given  that  the  outcome  really  is  a  target  outcome  from  the 
target  event  set,  £f  Mathematically,  Ptp(Aq)  is  defined  by  the  conditional  probability 

p(Al(Ct)n£t) 

PTp{  Ag)  =  P{Ag(e )  =  t\ee£t}=  - P  (£t) - ' 

Let  Ppp(Ag)  denote  the  probability  of  false  positive  classification  of  the  system  A g.  Then 
Ppp(Ag)  is  the  probability  that  the  classification  system  A g  labels  an  event  outcome,  e, 
as  a  target  label,  t,  given  that  the  outcome  is  really  a  non-target  from  the  non-target  set 
of  the  event  set,  £n.  This  is  Type  II  error  [7].  Mathematically,  Ppp(Ag)  is  defined  by  the 
conditional  probability 

p(aI  (Ct)  n£n) 

Ppp(Ag)  =  P{Ag(e)  =  t  |  e  G  £n}  =  - —  , - • 

Let  PpN(Ag)  denote  the  probability  of  true  negative  classification  of  the  system  A g.  Then 
PpN{Ag)  is  the  probability  that  the  classification  system  A#  labels  an  event  outcome, 
e,  as  a  non-target  label,  n,  given  that  the  outcome  really  is  a  non-target  outcome  from 
the  non-target  event  set,  £n.  Mathematically,  Ptn{Aq)  is  defined  by  the  conditional 
probability 

p(A[(gn£„) 

PTN(Ae)  =  P{Ag(e)  =  n  \  e  e  £n}  =  — - —  ,  „  r - -■ 

Let  PFN(Ag)  denote  the  probability  of  false  negative  classification  by  the  system  A g.  Then 
Ppisr(Ag)  is  the  probability  that  the  classification  system  A@  labels  an  event  outcome,  e, 
as  a  non-target  label,  n,  given  that  the  outcome  is  really  a  target  outcome  from  the  target 
event  set,  £t-  This  is  known  as  Type  I  error  [7].  Mathematically,  Ppjsr(Ag)  is  defined  by 
the  conditional  probability 

p( a\  (£n)  n  £t) 

PFN(Ae)  =  P{Ae(e)  =  n  \  e  G  £t]  =  — - — — 

p  y^t) 


Note  that  each  of  these  four  probabilities  are  dependent  on  the  threshold  value,  9.  A 
single  value  for  each  of  these  probabilities  is  computed  for  each  value  of  9.  As  the  value 


Theta 


Figure  2.4:  Typical  Type  I  and  Type  II  Errors. 

of  9  changes,  so  do  the  values  of  Pfp(Aq),  Ppp(Ag),  Ptn(Aq)  and  Pfn(Aq).  A  good 
illustration  of  these  probabilities  is  found  in  Figure  2.4. 

Define  0  as  a  set  of  possible  thresholds  and  for  each  6  £  0,  and  the  set  of  triples 
t~a  =  {{0,  Pfp(Aq),  PppiAg))  :  9  E  0} 

to  be  the  trajectory  of  A  [9],  [13].  We  can  project  this  trajectory  onto  the  second  and  third 
component  to  yield  the  set 

/a  =  {(Pfp(Aq),  Ptp(Aq))  :  9  6  0}. 

If  0  is  homeomorphic  to  the  real  numbers  M,  then  the  trajectory  ta  will  be  a  curve 
in  M3  and  the  projection  /a  will  be  a  curve  in  M2  (more  specific,  a  curve  in  the  unit  square 
[0, 1]  x  [0, 1]).  Formally,  this  curve  is  called  the  ROC  curve  for  the  system  family  A.  For 
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Figure  2.5:  A  ROC  trajectory  and  its  projection. 

the  case  when  0  is  discrete,  the  ROC  “curve”  is  a  set  of  discrete  points.  An  example  of 
this  projection  is  given  in  Figure  2.5. 

If  0  is  a  multi-dimensional  set  then  this  analysis  will  not  yield  a  single  curve  in  the 
Pfp-Ptp  plane.  Instead,  a  collection  of  curves  is  created.  Therefore,  we  choose  the  upper 
frontier  to  be  the  ROC  curve  as  representative  of  the  classifier  performance. 

Definition  2.  (ROC  function,  ROC  curve)  Let  A  =  {A#  :  9  G  0}  be  a  family  of  classifica¬ 
tion  systems  defined  on  the  probability  space  (£,  l£,  P )  mapping  to  the  label  set  £  =  (t,  n} 
with  parameter  set  0.  For  each  p  G  [0, 1]  ,  define  the  set 

Op  =  {9  G  0  :  Pfp(Aq)  <  p}. 

For  p  G  [0, 1],  if  Op  is  nonempty  then  define 

fk(p)  =  max{PTP(Ag)  :  9  G  Op}.  (2.1) 
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If  ©p  is  empty  then  fA(p)  is  not  defined.  The  function  fA  is  called  the  ROC  function.  The 
graph  of  /a  is  called  the  ROC  curve  [9]. 

In  practice,  the  set  @p  may  be  empty  for  certain  values  of  p.  We  avoid  the  discussion 
of  this  case  and  assume  that  the  ROC  function  is  defined  for  all  p  £  [0, 1].  We  make  this 
clear  by  defining  a  total  ROC  function. 

The  set  of  total  ROC  functions  may  be  defined  as: 

&  =  {/  :  [0, 1]  —  [0, 1]  |  /  is  non-decreasing  on  [0, 1]}. 

A  property  of  a  total  ROC  curve  are  given  in  the  following  theorem  [9] . 

Theorem  1.  Let  A  =  {A#  :  6  £  0}  be  a  family  of  classification  systems.  Then  /a  is  a 
non- decreasing  function.  That  is,  for  every  p,q  £  [0, 1]  with  p  <  q  then  fA(p)  <  fkih)  ■ 

Proof.  Let  p,q  £  [0, 1]  with  p  <  q  then  0p  C  ©^  therefore, 

fA{p)  =  max  Ptp(Aq)  <  maxPTP(Ae)  =  fA(q). 

0e©P  0e©q 

□ 

Definition  3.  (Set  of  total  ROC  functions)  Let  the  set  of  total  ROC  functions  be  denoted 
by 

&  =  {/  :  [0, 1]  — >  [0, 1]  |  /  is  non-decreasing  on  [0, 1]}. 

Notice  that  we  do  not  require  the  functions  to  be  continuous. 

We  write  f  =  g  to  mean  the  point-wise  equality,  that  is,  f(p)  =  g(p)  for  all  p  £  [0, 1]. 

2.5  Performance  Measures 

The  Receiver  Operator  Characteristic  (ROC)  Curve  depicts  the  trade-off  between 
true  positives  and  false  positives  for  every  allowable  threshold  of  the  classifier.  The  per¬ 
formance  functional  (which  will  be  defined  later),  when  applied  to  the  ROC  curve,  gives 
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the  classfication  analyst  something  by  which  to  measure  the  goodness  of  a  classifier.  The 
following  mathematical  development  introduces  the  ROC  curve. 

The  ROC  Curve  depicts  the  trade-off  between  true  positives  and  false  positives  for 
every  allowable  threshold  of  the  classifier.  A  performance  functional,  when  applied  to  the 
ROC  curve,  gives  the  classfication  analyst  something  by  which  to  measure  the  goodness  of 
a  classifier.  For  example,  if  one  pays  attention  to  the  upper  left  corner  of  the  graph,  ROC 
heaven  as  it  were,  one  might  be  tempted  to  grade  a  ROC  curve  on  how  closely  it  approaches 
this  corner  [3].  For  example,  one  might  view  the  point  of  closest  approach  of  the  ROC 
curve  to  the  corner  by  the  euclidean  norm  or  by  the  one  norm  as  a  measure  of  how  well  the 
system  is  performing.  In  the  Neyman-Pearson  method,  a  false  alarm  rate  is  specified,  and 
the  optimal  performance  is  found  by  the  fusion  combination  that  maximizes  the  vertical 
distance  above  the  chance  line  [8].  The  chance  line  joins  the  vertices  (0,0)  and  (1,1)  in  the 
ROC  curve  diagram.  The  Bayesian  Risk  of  a  ROC  curve  can  be  bounded  by  a  line  through 
the  ROC  heaven  corner,  whose  slope  depends  upon  the  weighting  between  the  cost  of  false 
negative  versus  the  cost  of  false  positive.  If  the  costs  are  equal,  then  the  line  will  be  parallel 
to  the  chance  line  (45  deg),  and  the  minimum  cost  is  found  by  the  tangent  to  the  ROC 
curve  that  is  parallel  to  the  chance  line.  As  the  weights  are  adjusted,  the  tangent  line 
tracks  the  angle  of  the  Bayesian  Risk  bound  [5].  Another  measure  of  goodness  is  the  area 
under  the  curve  (AUC).  While  this  does  reward  good  performers,  there  are  many  cases 
where  two  ROC  curves  would  achieve  the  same  score,  even  though  one  would  obviously  be 
better  by  how  much  more  to  the  left  of  the  chart  it  lived  than  its  ” equal”  counterpart.  The 
Summary  Receiver  Operator  Characteristic  SROC  curve  represents  another  performance 
measure  which  has  seen  increased  emphasis  in  the  statistical  community  [11]. 

Consider  the  case  when  two  sensors,  si  and  S2,  observe  outcomes  occurring  in  the 
same  population  set  £  .  Assume  they  produce  data  in  data  sets  V \  and  V 2.  That  is, 
si  :  £  — >  T>i  and  S2  :  £  —■ >  V 2-  Further,  assume  sensors  si  and  S2  each  have  a  processor,  pi 
and  P2,  respectively,  which  maps  datum  in  the  respective  data  sets,  V\  and  V 2,  to  features 
in  feature  sets  T\  and  T‘i-  In  particular,  assume  pi  :  V\  — >  T\  and  P2  :  ^2  — 1 ►  F 2 • 
Suppose  that  the  family  of  classifiers  for  pi  and  Si  is  given  by  {a q  :  6  6  0}  and  that 
the  family  of  classifiers  for  P2  and  S2  is  given  by  another  family,  {b^,  :  </>  G  <!>}.  Let 


12 


a.g  :  T\  — >  C\  for  each  S  6  0  and  :  T2  —■ ►  C 2  for  each  </>£$.  Then  the  labels 
that  are  produced  from  each  of  the  classification  systems  are  fused  together  to  create  an 
overall  label  for  the  outcome  of  interest.  The  composition  of  these  mappings  yield  systems 
represented  by  the  following  diagram. 


Vi 

Pi 

A, 

a  g  C\ 

Sl 

Data 

— 

Feature 

— >  Label 

£ 

/ 

\ 

C 

Event 

\ 

V2 

P2 

a2 

b<£  £2 

/ 

Label 

S2 

Data 

- > 

Feature 

— >  Label 

We  will  suppress  the  text  to  simplify  the  diagram  to  the  following 


For  these  two  classification  systems  the  compositions  yield  the  systems  A g  =  ag  o  o  si 
for  each  9  £  0  and  o  p2  o  S2  for  each  0  £  $.  Thus,  the  individual  diagrams  are 


£ 


Ag 


Cl 


£ 


and  the  two  families  of  classification  systems  are  given  by  A  =  { Ag  :  9  £  0}  and  B  =  {B^  : 
0  G  (See  ref.  [9]).  The  two  classification  systems  developed  above  map  outcomes  from 
the  population  set  into  different  data,  feature,  and  label  sets,  which  are  then  used  to  fuse 
the  classification  systems  together. 

There  are,  however,  other  ways  to  label  outcomes  from  the  event  set.  In  this  paper, 
classification  systems  can  map  outcomes  into  either  the  same  or  different  data  sets  or  the 
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same  or  different  feature  sets.  The  sets  which  must  remain  the  same  for  the  mathematical 
development  contained  herein  are  the  event  set  £  and  the  two-class  label  set  C.  Therefore, 
the  classification  systems  must  be  acting  from  the  same  event  set,  map  into  either  the  same 
or  different  data  and  feature  sets  and  eventually  map  into  the  same  label  set.  These  labels 
are  combined  together  to  generate  one  overall  label  for  that  outcome. 

In  this  paper,  we  assume  that  the  two  classification  systems  are  independent  or 
uncorrelated.  That  is,  the  occurrence  or  non-occurrence  of  an  event  classified  by  one 
system  will  not  affect  the  occurrence  or  non-occurrence  of  another  event  classified  by  the 
other  system.  This  simplifies  the  expression  of  conditional  probabilities. 

We  will  also  only  consider  two  label  classifiers..  That  is,  the  label  set  for  all  systems 
considered,  including  each  individual  system  and  the  fused  classification  system,  contains 
two  values  or  two  classes.  Examples  of  possible  members  of  this  label  set  were  given 
previously,  but  the  label  set  considered  here  is  C  =  {t,  n }  where  t  =  “target”  and  n  = 
“non-target” . 

Using  the  premises  of  label  fusion,  a  two-class  label  system,  and  classifier  indepen¬ 
dence,  representations  for  a  two  classification  system  are  developed. 

The  OR  rule  is  a  binary  operation  defined  on  C.  Define  the  OR  operation  by  V.  Its 
definition  is  given  in  the  table: 


V 

t 

n 

t 

t 

t 

n 

t 

n 

Then  the  new  classification  system  CgR  is  defined  by  the  point-wise  OR  operation 

C$(e)  =  A e(e)  V  B*(e)  for  all  e  €  £  (2.2) 

and  yields  a  new  classification  system  family  COR  =  {CZ  :  6  G  &,<j)  G  <&}.  Thus,  to  be 
labeled  as  “target”,  either  the  label  from  A g  or  must  be  the  ’’target”  t  label  [9]. 
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The  AND  rule  is  a  binary  operation  defined  on  C.  We  denote  this  operation  by  A. 
Its  definition  is  given  in  the  table: 


A 

t 

n 

t 

t 

n 

n 

n 

n 

The  new  classification  system  C^,0  is  defined  by  the  point-wise  AND  operation  on  its 
output,  that  is, 

C£J(e)  =  A e(e)  A  B*(e)  for  all  e  €  8.  (2.3) 

This  produces  a  new  classification  system  family  CAND  =  {C^D  :  9  e  0,0  e  $}.  Thus, 
to  be  labeled  as  “target”,  both  the  label  from  A g  and  must  be  the  target  “t  ”  label  [9]. 

The  NOT  rule  is  a  unary  operation  defined  on  C.  We  denote  the  NOT  operation  by 
—r.  Its  definition  is  given  in  the  table: 


~r 

t 

n 

n 

t 

Then  the  new  classification  system  —tA.q  is  defined  by  the  point-wise  NOT  operation 

[— t-A q\  (e)  =  —r[Ag{e)\  for  all  e  G  8  (2.4) 


and  yields  a  new  classification  system  family  CNOT  =  {^A^  :  6  E  0}.  Thus,  to  be  labeled 
as  “target”,  the  label  from  system  A g  must  be  the  “non-target”  n  label.  For  brevity  we 
write  -r A  =  CNOT  .  Clearly,  the  NOT  rule  is  not  a  fusion  rule,  but  it  will  be  used  in  certain 
situations  [9]. 

A  fusion  rule  is  a  method  of  combining  multiple  classifiers  presumably  with  the  intent 
of  achieving  better  performance.  Since  the  outcome  of  our  classifiers  is  binary,  either 
target/no  target,  then  it  is  reasonable  that  some  Boolean  rule  might  express  the  optimal 
combination  of  classifiers.  It  is  a  well  known  fact  that  the  total  number  of  possible  binary 
outputs  of  k  combinations  of  two  label  classifiers  is  22k  [16].  Listed  below  is  all  possible 
binary  outcomes  with  just  k  =  2  binary  classifiers. 
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Figure  2.6:  Possible  Fusion  Rules  for  Two  Binary  Decisions  [16]. 


One  can  demonstrate  that,  using  ANDs,  ORs,  and  NOTs  one  can  populate  or  cover 
this  entire  set.  But  that  is  not  the  object  here.  Many  of  the  rules  listed  in  Figure  2.6 
are  poor  performers.  Identifying  the  best  performers  without  having  to  evaluate  every 
possible  combination  will  be  desirable.  Varshney  shows  that  including  nronotonic  fusion 
rules  eliminates  the  majority  of  the  poor  performers.  A  detailed  definition  of  nronotonic 
fusion  rules  is  found  on  pages  63-64  of  Varshney  [16] .  Before  proceeding  further,  a  formal 
definition  of  fusion  is:  Let  91  be  a  set  of  rules: 

m  =  {r  :  C  x  C  — ►  £} 


Label  fusion  with  respect  to  p  is: 

p(r*(A,B))  =  maxp(r(A,B))  >  max{p(A),  p(B)}  (2.5) 

reiH 

Monotonicity  of  fusion  rules  is  analogous  to  the  nronotonicity  of  switching  functions  or 
finding  the  simplest  disjunctive  normal  form  for  a  given  truth  function  [6] .  Examples  of 
poor  performers  which  never  escape  the  chance  line  are  the  constant  fusion  rules  which 
either  always  assign  the  target  label,  or  always  the  non-target  label.  When  fusing  multiple 
classification  systems,  exclusionary  rules  are  not  desirable.  Suppose  we  had  classification 
systems  A,  B,  and  C.  A  fusion  rule  which  says  always  believe  A  and  disregard  systems  B 
and  C  is  not  a  fusion  rule,  since  it  does  not  deliver  results  strictly  better  than  any  individual 
classification  system  in  Equation  2.5.  If  each  classifier  is  doing  better  than  chance,  it  will 
become  evident  that  Boolean  meet  (AND)  and  join(OR)  are  all  that  is  needed  to  optimize 
the  label  fusion. 
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The  focus  of  our  research  will  be  restricted  to  label  fusion.  Further  we  require  that 
each  classifier  being  combined  has  identical  labels.  Since  the  classifiers  being  combined  are 
ordered  towards  the  same  target /no  target  label,  and  due  the  desire  for  monotonic  optimal 
fusion  rules,  we  only  need  consider  the  Boolean  “OR”  and  the  Boolean  “AND” .  Use  of 
the  “NOT”  unary  operator  is  not  necessary  for  optimal  fusion.  There  are  no  exceptions.  In 
the  case  of  the  classifier  that  operates  under  the  chance  line,  a  application  of  the  “NOT” 
to  that  classifier  would  precede  any  attempt  to  fuse  with  other  classifiers. 

2.6  Boolean  Algebra 

The  definition  of  a  Boolean  Algebra  is  given  below  [9]  [1] . 

Definition  4.  A  Boolean  Algebra  is  an  algebraic  structure,  denoted  by  (&/,  =,A,  V,  — r) 
where 

srf  is  a  nonempty  set  of  elements; 

=  denotes  element  equality; 

A  is  a  binary  operation  on  elements  in  stf  called  AND,  conjunction,  or  meet; 

V  is  a  binary  operation  on  elements  in  stf  called  OR,  disjunction,  or  join; 

—7  is  a  unary  operation  on  elements  in  stf  called  NOT  or  negation  (or  complementation). 
And  the  following  axioms  hold  true: 

1.  srf  is  closed  w.r.t.  A,  V  and  —r.  For  every  a,  b  £  srf 

a  Ab  G  srf  a  V  b  G  srf  ~a  €  &/ 

2.  srf  is  associative  w.r.t.  A  and  V.  For  every  a,i),c£^/ 

(a  A  b)  A  c  =  a  A  (b  A  c)  (a  V  b)  V  c  =  a  V  (6  V  c) 
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3.  srf  is  commutative  w.r.t.  A  and  V.  For  every  a,b  £  srf 

aAb=bAa  aVfr=6Va 

4.  srf  has  unique  identities  w.r.t.  A  and  V.  There  exists  unique  elements  l,u  £  sd  such 
that  for  every  a  E  srf 

a  Au  =  a  a  V l  =  a 

5.  srf  is  absorptive  w.r.t.  A  and  V.  For  every  a,b  £  srf 

a  A  (a  V  b)  =  a  a  V  (a  A  b)  =  a 

6.  srf  is  distributive  w.r.t.  A  and  V.  For  every  a,6,c£^ 

a  A  (b  V  c)  =  (a  A  b)  V  (a  A  c) 
a  V  (6  A  c)  =  (a  V  6)  A  (a  V  c) 

7.  ^  contain  complements  w.r.t.  A  and  V.  For  every 

a  A  ~a  =  l  aV~a  =  u 

There  are  several  other  properties  that  follow  from  these  axioms,  see  [1]  for  a  larger 

list. 

2. 7  Lattice 

Definition  5.  A  Lattice  is  an  algebraic  structure,  denoted  by  (Jz?,=,  A,  V)  where 

Jz?  is  a  nonempty  set  of  elements; 

=  denotes  element  equality; 

A  is  a  binary  operation  on  elements  in  Jz?  called  AND,  conjunction,  or  meet; 

V  is  a  binary  operation  on  elements  in  Jz?  called  OR,  disjunction,  or  join. 
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And  the  following  axioms  hold  true: 


1.  is  closed  w.r.t.  AandV.  For  every  a,  b  6  Jz? 

aAiGif  aV6G«Sf 

2.  «£?  is  associative  w.r.t.  A  and  V.  For  every  a,  6,  c  G  22? 

(a  A  b)  A  c  =  a  A  (6  A  c)  (aW  b)  W  c  =  aW  (bW  c) 

3.  «£?  is  commutative  w.r.t.  A  and  V.  For  every  a,  b  E  _Sf 

aAb=bAa  aVb=b\/a 

4.  «£?  has  unique  identities  w.r.t.  A  and  V.  There  exists  unique  elements  l,u  £  such 
that  for  every  a£if 

a  Au  =  a  a  V l  =  a 

5.  2z?  is  absorptive  w.r.t.  A  and  V.  For  every  a,  b  £  22? 

a  A  (a  V  6)  =  a  a  V  (a  A  6)  =  a 

The  lattice  contains  only  combinations  of  meets  and  joins,  which  are  the  nronotonic 
subset  of  all  possible  combinations  of  meet,  joins,  and  complements  in  the  Boolean  algebra 
(See  page  41  of  [15]).  When  we  investigate  the  optimum  combination  of  classifiers,  we  are 
only  interested  in  unique  nronotonic  rules  [16]. 
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III.  Main  Results 


3. 1  Introduction 

If  it  can  be  established  that  there  exists  an  epimorphism  between  a  Boolean  algebra  of 
ROC  curves  and  a  Boolean  algebra  of  classification  system  families,  then  finding  the  best 
combination  of  classification  system  families  can  be  done  by  combining  their  respective 
ROC  curves,  obviating  the  need  for  additional  testing  of  each  and  every  combination. 
Given  how  expensive  it  might  be  to  generate  each  datum  on  a  single  ROC  curve,  imagine 
how  expensive  it  would  become  to  generate  each  datum  on  every  ROC  curve  arising  out 
of  a  Boolean  algebra.  To  put  this  in  perspective,  for  any  k  two-label  classifiers,  there  are 
22>i  possible  Boolean  combinations  [16]: 

Number  of  Systems  Number  of  Fusion  Rules 

2  16 

3  256 

4  65,536 

For  the  purposes  of  label  fusion  of  identical  labels,  utilizing  ROC  curves  which  all  fall 
above  the  chance  line,  there  is  no  need  to  include  the  NOT  as  it  would  be  counterproductive 
towards  improving  overall  classification  performance.  Utilizing  only  the  Boolean  join  and 
meet,  our  Boolean  algebra  reduces  to  a  lattice[16] .  If  we  can  show  that  a  join  between  two 
ROC  curves  is  epimorphicly  equivalent  (onto)  to  an  OR  (join)  between  their  respective 
classification  systems,  and  if  we  can  show  the  same  for  the  AND  (meet),  then  we  can  show 
it  for  any  finite  combination  using  meets  and  joins.  Since  we  are  interested  in  optimizing 
the  best  combination  of  a  finite  number  of  classification  systems,  it  will  be  good  to  know 
that  what  we  learn  from  optimizing  ROC  performance  will  be  equivalent  to  optimization 
of  the  classification  systems. 

The  first  step  will  be  to  show  that  the  meet  and  join  of  ROC  curves  is  equivalent  to 
the  AND  and  OR  of  classification  families. 
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3.2  OR  Formula 


We  will  capitalize  on  the  development  of  Schubert  [12]  which  proves  the  formula  for 
the  OR  of  ROC  curves.  We  start  with  the  development  of  Ppp{Ag  VB^)  and  Pp^AflVB^) 
[12].  Recall  that  Cn  =  {n}. 


Ptp(Aq  V  B^) 


1  -  Pfn{Ao  V  B^) 

l-P([AgWB^(Cn)\£t) 

1  -  P(A\(Cn)  O  B\(Cn)\St) 

1  -  P(A^(£n)|£i)P(B^(£n)|£t)  by  independence 
1  —  PFN{Ag)PpN(B(f>) 

1  —  [1  -  Ptp(Aq)]  [1  —  Ppp{ B^)] 

Ptp(Aq)  +  Ppp{ B^)  -  PTp(Ag)PTp(B(j)).  (3.1) 


Pfp( Aq  V  B^) 


1  —  Ptn{Aq  V  B^) 

l-P([AgVB^(£n)\£n) 

1  -  P(Aj(£n)  O  B\(C,n)\£n) 

1  -  P(A\(Cn)\£n)P{B^{Cn)\£n)  by  independence 

1  -  PTN(Ag)PTN(B<t>) 

1  —  [1  -  PFp{Ag)}  [1  —  Ppp^Btf,)] 

Ppp(Ag)  +  Ppp(B^)  -  PFp(Ag)PFp(Bfi).  (3.2) 


Let  r  e  [0, 1]  be  a  value  for  the  probability  of  false  positive  for  the  fused  classifier  = 
A g  V  B<£,  then  /a8vb<#)(^)  is  the  value  of  the  probability  of  true  positive  for  classifier  CgR 
For  p,q,r  €  [0, 1]  define 


Op  =  {9  G  O  :  Ppp(Ag)  =  p} 

=  {</>£$:  PFP( B0)  =  q} 

=  {(^)e0xf:PFP(Q  =  r}. 
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Theorem  2.  For  every  r  £  [0, 1]  then 


=  \j  ®Px 

p.?e[o,i] 

p+q—pq—r 

Proof.  Choose  r  £  [0, 1]  and  let  it  be  fixed.  Let 

(O',  </>')£  U  BP  x 

p,ge[0,l] 

p+q-pq=r 

then  there  exists  some  p'.q1  £  [0, 1]  such  that  O’  £  Qpi  ,  ft  £  <1^/  ,  and  p'  +  q'  —  p'q'  =  r. 
From  the  definitions  of  0p,  <hg,  and  ’F°R  we  have 

V  =  Ppp(Agi) 
q  =  PppfBft) 

r  =  PFP(  CZ) 


which  implies 


Ppp(Ag')  +  Ppp(B 0/)  -  F>pp(A(9/)Ppp(B0/)  =  r  =  Pp\p( C^/) 


and  so 

(0',<^/)  £  ^°R. 

thus, 

u  Bp  X  c  'F°R. 

p,ge[0,l] 

p+q—pq=r 

On  the  other  hand,  let  (O',  ft)  £  \IL?R  then  Pfp( C^,)  =  r.  Observe  that  O'  £  0p/  for 
some  p'  £  [0, 1]  and  (f>'  £  for  some  q'  £  [0, 1]  we  have  that  1  —  p'  £  [0, 1]  and  1  —  q'  £  [0, 1] 
so  that  (1  —  p')(  1  —  gr)  £  [0, 1]  and  therefore, 

i  -  (i  —  p')(i  -  <?')  €  [0,1]. 
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Since 

1  -  (1  -  p')(  1  -  (/)  =  p'  +  q'  -  p'q' 


then 

p>  +q  -  p'q'  e  [o,i]. 


Thus,  there  exists  real  numbers  p'  and  q '  such  that 


/  i  /  /  / 

p+q— pq=r 


which  implies  that 

\J  0p  x 

P,(?e[o,i] 

p+q—pq=r 

Since  (O', (ft)  were  chosen  arbitrary  then, 


C  |J  Qp  x  $q. 

p,q£[0,l] 

p+q—pq=r 


Combining  results  we  have  set  equality 


=  U  0P  X  *<?• 

p,q£[0,l] 

p+q—pq—r 

Since  r  £  [0, 1]  was  chosen  arbitrarily  these  sets  are  equal  for  every  r  £  [0, 1].  □ 

To  form  the  ROC  curve  for  the  fused  classification  system,  we  want  to  maximize 
P/uKA^VB^,)  and  minimize  Pp^AgVB^).  Consider  the  constrained  optimization  problem 
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for  every  r  €  [0,1]: 


r )  =  max  PTp(  Ag  V  B^) 

=  max  Ptp(Ao)  +  Ppp{  B«^)  —  PTp(Ag)PTp{B(j) ) 

{6,4>)ev°n 

=  max  Ptp{A9 )  +  Ppp{ B^)  -  Ptp(Aq)Ptp(B<i> ) 

( 0,(f>)€.  U  0pX$g 

p,q£[  0,1] 

p+q—pq=r 

=  max  max  Ptp{A9)  +  -Ptp(B^)  -  PTp{Ag)PTp{B(p) 

P,«/e[o,i]  (M) eePx$9  v  v 

p+q-pq—r 

=  m^in  ™  *  I1  “  I1  “  pTP(Ae)}  [1  -  PTP{ B0)]] 

P,<?e[o,i]  (9,<f>)e@px&q 
p+q-pq=r 


max 

p.?e[o,i] 

p+q—pq=r 


max 

P,<?e[o,i] 

p+q—pq—r 


1  -  min  [1  “  Ptp(Aq)]  [1  -  PTp(B<f,)\ 


1  -  min  [1  -  PpplyAg )]  min  [1  -  PTP( B^)] 
0e0P  </>e$9 


1  -  [1  -  maxPTp(A0)][l  -  maxPrp(B^)] 
0e0p  <(>e$9 


max 
p>?e[o,i] 

p+q—pq=r 

“a*  [!-[!-  /a(p)]  [1  -  fa(q)]] 
p.?e[o,i] 

p+q—pq—r 

max  ,  L/a(p)  +  /b(<?)  -  /a(p)/b(?)] 

P,<?e  [o,i] 

p+q—pq=r 


Note  that  q  is  a  function  of  p  such  that  p  +  q  —  pq  =  r.  Solving  for  q  in  terms  of  p  for  r 


fixed  yields 


<1  =  Q(p ) 


r  —  p 

1-p 


for  0  <  p  <  r.  Therefore,  an  equivalent  formula  is 


r)  =  max  [fA(p)  +  /B(g)  -  /a(p)/b(<?)] 
p>?e[  o,i] 

P+q—pq=r 


=  max  [/A(p)  +  /b(Q(p))  -  /a(p)/b(Q(p))]  • 

pe[0,r] 


(3.3) 
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Now  that  we  have  a  formula  for  the  join  [12],  this  motivates  the  creation  of  a  new  symbol 
that  represents  this  operation.  Given  /,  g  G  &  we  will  write 

[/  U  g\(r)  =  max  [f(p)  +  g(q)  -  f{p)g{q)}  (3.4) 

p+q—pq=r 

We  read  /  U  g  as  “/  or  g" .  We  use  the  symbol  U  rather  than  V  in  order  to  distinguish  it 
from  dealing  with  classification  systems  [9]. 

Next  we  will  test  to  see  if  this  operation  satisfies  the  properties  of  a  lattice. 

1.  (idempotent)  Since  A  V  A  =  A  it  follows  that 

/a  U  /a  =  /ava  =  /a 

2.  (commutativity)  Testing  for  commutativity  of  the  join, 

/a  U  /b  =  /avb  =  /bva  =  /*U/a 

By  commutativity  of  the  families  with  regard  to  the  join  V,  then  the  commutativity 
of  the  join  LI  is  satisfied. 

3.  (associativity)  Observe  that 

(/a  u  /b)  u  fc  =  /(AVB)VC  =  /av(BVC)  =  /a  U  (/b  u  /c) 

4.  (identity)  Define  /n(p)  =  0  for  every  p  6  [0, 1].  Then 

/a  U  /n  =  /a 

5.  (maximal  element)  Define  ,/t (p)  =  1  for  every  p  e  [0, 1].  Then 

/a  U  /t  =  /t 
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3.3  AND  Formula 


We  also  will  employ  the  development  of  Schubert  [12]  which  develops  the  formula 
for  the  AND  of  ROC  curves.  Consider  the  development  of  the  probabilities  of  true  and 
false  positive  ( PppiAgAB^ )  and  Pfp( AgAB^),  respectively)  for  the  AND  label-fusion  rule 
under  the  assumption  of  independence.  Recall  Ct  =  {f} 

PTP( AeAB^)  =  P{[Ag  AB${£t)\£t) 

=  P{x\{Lt)nB\{Lt)\St) 

=  P(Al(£t)\£t)P(B^£t)\£t) 

=  Ptp(Ao)Ptp(B,p).  (3.5) 

PFP(  AeAB^)  =  P([  AeAB^(£t)\£n) 

=  P(Al(£t)nB\(£t)\£n) 

=  P(A\{£t)\£n)P(B\{£t)\£n) 

=  PFp{Ag)PFp{B(/)).  (3.6) 

Let  p  be  a  value  for  the  probability  of  false  positive  for  classifier  Ag  then  f&(p)  is  the  value 
of  the  probability  of  true  positive  for  classifier  Ag.  Similarly,  let  q  be  a  value  for  the 
probability  of  false  positive  for  classifier  then  /b(<?)  is  the  value  of  the  probability  of 
true  positive  for  classifier  .  Let  r  be  a  value  for  the  probability  of  false  positive  for  the 
fused  classifier  A^AB^,  then  /a0ab^(^)  is  the  value  of  the  probability  of  true  positive  for 
classifier  A^AB^.  Define  the  sets 

Qp  =  {6  e  ©  :  Ppp(Ag)  =  p } 

=  {0  €  $  :  Pfp(B^)  =  q} 

=  {(^;^)  G  0  X  :  PFP(AgAB(j>)  =  r}. 
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Theorem  3.  For  every  r  £  [0, 1]  then 


KND  =  U  0p  x  *</• 

p,ge[0,l] 

pq=r 

Proof.  Choose  r  £  [0, 1]  and  let  it  be  fixed.  Let 

{O' A')  £  U  0px<^ 

p,ge[0,l] 

pq=r 

Then  6'  £  0p >  and  <f>  £  <&q'  for  some  p',  </  £  [0, 1]  such  that  p'q'  =  r.  From  the  definitions 
of  Op,  <frg,  and  we  see  that 

P  =  Pfp(  A*?') 
q  =  PppCBp) 
r  =  PpplApAB^) 


which  implies 


Ppp(A(j/)Ppp(B^/)  =  Ppp(A6»AB0) 


and  therefore,  [O' A')  £  'LrND-  Since  (O' A')  were  arbitrary  and  hence,  p',q'  £  [0, 1]  then 


*rD  ^  u  0P  x  dv 

p.?e[  o,i] 
pq=r 

Let  (O' A')  ^  'IPND  then  Ppp^pAB^)  =  r.  For  some  // ,  q'  £  [0, 1]  such  that  O'  £  0p/  and 
4>'  £  <hg',  we  observe  that 


p'q'  =  Ppp(A0/)Ppp(B0/)  =  Ppp(AeiAB0)  =  r 


which  implies 

(O',  (tf)  £  Op'  x  C  |J  0p  x 

P,<?e[o,i] 

pq=r 
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Hence 


*/ND  C  \J  Qp  x 
p,ge[0,l] 

p+q—pq=r 

Combining  set  containments  yields  set  equality.  Since  r  G  [0, 1]  was  chosen  arbitrary,  then 


=  \j  qp  x 
p,«e[o,i] 

pq=r 


for  all  r  G  [0, 1].  □ 


To  form  the  ROC  curve  for  the  fused  classification  system,  we  want  to  maximize 
-Ppp(  A^AB^,)  and  minimize  Ppp(AgAB(j>).  Consider  the  constrained  optimization  prob¬ 
lem  for  every  r  G  [0, 1]: 


/aab(0  =  max  PTp(AgAB(f)) 

(0,</>)e^A  nd 

=  max  PTp(Ag)PTp(B(p) 

(0,0)e^AND 


max 
(0,</>)e  U 

P,9£[0,l] 

pq=r 


Ptp(Aq)Ptp(B(/) ) 


=  max  max  PTp(Ag)Prp(B(k) 

p,g£[0,l]  L(®;</l)£©pX$9 
pq=r 


=  max  [maxHppfAfllllmaxPppfBAl 
p,ge[0,i]  |_0eep  v 

pq=r 


=  max  /a(p)/b(<?)- 
p,qe[  o,i] 

pq=r 


Next  we  will  test  to  see  if  this  operation  satisfies  the  properties  of  a  lattice. 


1.  (idempotent)  Since  A  A  A  =  A  it  follows  that 


2.  (commutativity)  Testing  for  commutativity  of  the  join, 

/a  n  /b  =  /aab  =  /baa  =  /b  n  /a 
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By  commutativity  of  the  reals  with  regard  to  multiplication,  the  commutivity  of  the 
meet  is  satisfied. 

3.  (associativity)  Observe  that 

(/a  n  /b)  n  fc  =  /(aab)ac  =  /aa(bac)  =  /a  n  (/b  n  fc) 


4.  (identity) 


/a  n  /t  =  /a 


5.  (minimal  element) 


/a  n  /n  =  /n 


3-4  Epimorphism 

Our  main  result  is  the  following  theorem  [9]. 

Theorem  4.  Let =  {A^,  . . . ,  AlA)}  be  a  collection  of  K  families  of  total  classifica¬ 

tion  systems  that  are  mutually  independent.  Let  (CSFif#),  =,  A,  V)  denote  the  Lattice  of  to¬ 
tal,  independent  classification  system  families  generated  by  .  Let  LF  =  {/A(i) ,  /A( 2) , . . . ,  /A(x) } 
be  the  collection  of  K  ROC  curves  corresponding  to  <£ .  Then  (ROC(^),  =,  n,  U)  is  a  Lat¬ 
tice  of  ROC  curves  that  is  epimorphic  to  (CSF(&),  =,  A,  V). 

Proof.  Define  the  mapping 

F  :  (CSF(9),=,  A,  V)  -»•  (ROC(&),  =,  n,  U) 


to  be 

F(  A)  =  fA. 

Then  it  is  clear  that 

Dom(F)  =  CSF(%?) 


and, 


F(  AVB)  =  F(A)UF(B) 
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while. 


F(  AAB)  =  F(A)  n  F(M) 


then  it  is  clear  that 

Ran(F)  =  ROC{&). 


If  A  /  B  but  /a  =  /b  then  F  is  not  one-one.  Thus,  the  lattices  CSF( &)  and  ROC(^F) 
are  epimorphic.  □ 
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IV.  Examples 


The  best  way  to  show  the  power  of  label  fusion  is  to  incorporate  formulas  3.7  and  3.3  for 
AND  and  OR  into  some  Matlab  code  and  visually  evaluate  the  results. 

Suppose  we  have  three  ROC  curves  from  classification  systems  A,B,C;  with  the 
curves  defined  as  follows: 


/a  (p) 


2 p,  0  <  p  <  0.4 

p/ 3  +  2/3,  0.4  <  p  <  1.0 


hip)  =  tanh(4p) 


/c(p)  =  p1/3 

The  following  plots  will  show  how  these  ROC  curves  combine,  and  how  the  optimal 
fusion  rules  result  in  better  performance  overall. 

4-1  Two  Classifiers 

With  only  two  systems,  fusion  can  either  be  the  AND,  or  the  OR.  Which  one  is 
better  is  dependent  on  the  shape  of  the  original  ROC  curves. 

4  ■  2  Three  Classifiers 

Using  only  AND’s  and  OR’s  from  the  Lattice,  that  is  the  monotonic  combinations, 
we  can  see  that  combining  classifiers  has  benefits  which  range  across  the  total  ROC  curve. 
It  should  be  noted  in  Figure  4.5  that  the  majority  vote,  delivers  remarkably  good  results. 
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Figure  4.1:  ROC  Curves  of  Classification  System  Families  A  and  C. 


Figure  4.2:  ROC  Curves  of  Classification  Systems  A  and  C,  Label  Fused. 
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Figure  4.3:  ROC  Curves  of  Classification  System  Families  A,  B,  and  C. 


Figure  4.4:  ROC  Curves  of  Classification  Systems  A,  B,  and  C,  Label  Fused. 
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Figure  4.5: 


ROC  Curves  of  Classification  Systems  A,  B,  and  C  with  Majority  Rule. 
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V.  Conclusions 


5.1  Summary 

We  began  this  thesis  discussing  the  need  for  optimal  target  detection  arising  in  many 
different  fields.  It  was  proposed  that  the  combination  of  multiple  classification  systems 
might  lead  to  more  optimal  target  detection  performance.  We  restricted  our  study  to 
the  label  fusion  of  multiple  independent  two-label  classification  systems.  By  doing  so 
we  were  able  to  quantify  the  performance  of  their  combination  in  the  same  way  as  is 
done  for  the  individual  systems;  by  means  of  the  ROC  curve.  Provided  our  assumptions 
were  maintained,  we  found  that  the  ROC  curve  of  two  classification  systems  which  were 
AND’ed  was  onto  the  meet  of  the  individual  ROC  curves.  We  found  the  same  result  for 
the  ROC  of  the  OR’ed  classification  systems  and  the  ROC  of  their  join.  These  results  were 
checked  against  a  list  of  properties  of  binary  operators  to  convince  us  that  we  were  working 
with  lattices,  and  that  the  Lattice  of  Classification  System  Families  is  epimorphic  to  the 
Lattice  of  ROC  curves.  We  applied  the  label  fusion  techniques  to  some  examples  to  show 
visually  how  increased  ROC  performance  may  be  achieved  by  the  optimal  combination  of 
classification  systems.  The  majority  vote  was  a  stellar  performer  for  the  systems  we  chose  to 
test.  What  we  did  not  show,  and  possibly  cannot  show,  is  that  a  given  ROC  curve  is  one-one 
with  the  classification  system  which  produced  it.  Since  a  ROC  curve  is  a  measure  of  how 
well  a  classification  system  is  performing,  it  may  not  be  unique,  as  multiple  classification 
systems  might  enjoy  the  same  measure  of  performance.  So  we  held  short  from  claiming 
that  the  lattice  of  ROC  curves  is  isomorphic  to  the  lattice  of  classification  systems.  Even 
though  our  developments  only  represented  a  surjective  relationship  between  classification 
system  families  and  their  ROC  curves,  it  still  was  and  is  a  noteworthy  accomplishment. 
Given  the  enormous  cost  of  building  and  testing  combinations  of  classification  systems, 
and  generating  their  ROC  performance,  testing  the  lattice  of  ROC  curves  in  software  with 
existing  individual  ROC  curves  can  represent  a  significant  cost  savings  in  the  design  of 
optimal  classification  systems. 
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5.2  Future  Work 


It  may  be  possible  to  show,  and  it  may  be  worth  proving,  that  the  ROC  curves  of 
optimal  combinations  of  classification  systems  are  unique,  allowing  us  to  assert  that  there 
exists  an  isomorphism  between  the  lattice  of  ROC  curves  and  the  lattice  of  classification 
systems.  This  might  prove  useful  when  attempting  to  reverse  engineer  what  went  into  each 
classification  system  given  its  ROC  performance. 

It  might  also  be  worthwhile  to  prove  that  the  lattice  of  ROC  curves  is  indeed  a 
distributive  lattice.  This  would  require  that  it  satisfy  the  distributive  property.  We  were 
not  able  to  verify  this. 

We  briefly  discussed  some  different  ways  of  evaluating  the  ROC  performance  of 
a  given  combination  of  classification  systems.  Area  Under  the  Curve  (AUC),  Neyrnan- 
Pearson,  and  Bayesian  Risk  were  examples.  If  a  functional  could  be  defined  that  could 
predict  optimal  ROC  performance  without  having  to  form  the  set  of  ROC  curves  generated 
by  the  lattice  of  classification  system  families,  a  further  cost  savings  might  be  achieved  by 
reducing  the  amount  of  computation  to  deliver  optimal  performance. 
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